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ABSTRACT 

By  extending  the  concepts  used  in  contemporary  two-level  hierarchical 
storage  systems,  such  as  cache  or  paging  systems,  it  is  possible  to  develop 
an  orderly  strategy  for  the  design  of  large-scale  automated  hierarchical 
storage  systems.   In  this  paper  systems  encompassing  up  to  six  levels  of 
storage  technology  are  studied.   Specific  techniques,  such  as  page 
splitting,  shadow  storage,  direct  transfer,  read  through,  store  behind, 
and  distributed  control,  are  discussed  and  are  shown  to  provide  considerable 
advantages  for  increased  systems  performance  and  reliability. 
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INTRODUCTION 


and 


The  evolution  of  computer  systems  has  been  marked 
by  a  continually  increasing  demand  for  faster,  larger, 
and  more  economical  storage  facilities.   Clearly,  if 
we  had  the  technology  to  produce  ultra-fast  limitless- 
capacity  storage  devices  for  miniscule  cost,  the 
demand  could  be  easily  satisfied.   Due  to  the  current 
unavailability  of  such  a  wonderous  device,  the 
requirements  of  high-performance  yet  low-cost  are 
best  aatisfied  by  a  mixture  of  technologies  combining 
expensive  high-performance  devices  with  inexpensive 
lowar-performance  devices.   Such  a  strategy  is  often 
called  a  hlernrchical  storage  system  or  multilevel 
storage  system. 


CONTEMPORARY  HIERARCHICAL  STORAGE  SYSTEMS 

Range  of  Storage  Technologies 

Table  I  indicates  the  range  of  performance  and 
cost  characteristics  for  typical  current-day  storage 
technologies  divided  into  6  cost-performance  levels. 
Although  this  table  is  only  a  simplified  summary,  it 
does  illustrate  the  fact  that  there  exists  a  spectrum 
of  devices  that  span  over  6  orders  of  magnitude 
U00,000,000\)  in  both  cost  and  performance. 

Locality  of  Reference 

If  all  references  to  online  storage  were  equally 
likely,  the  use  of  hierarchical  storage  systems  would 
be  of  marginal  value  at  best.   For  example,  consider 
a  hierarchical  system,  M',  consisting  of  two  types  of 
devices.  Mil)  and  M(2),  with  access  times  A  (l)-.05xA  (2) 
and  costs  C  (D-20xC  (2)  ,  respectively.   If  the  total 
storage  capacity  S*  were  equally  divided  between  M(l) 
and  M(2),  i.e.,  S(l)-S(2),  then  we  can  determine  an 
overall  cost  per  byte,  C,  and  effective  access  time. 


C  -  .5  x  C(l)  +  .5  x  C(2) 

C*  -  10.5  x  C(2)  -  .5045  x  C(l) 


A'  -  .5  x  All)  +  .5  x  A(2) 

A*  -  .525  x  A(2)  -  10.5  x  A(l) 

From  these  figures,  we  see  that  the  effective 
system,  M',  costs  about  half  as  much  as  M(l)  but  is 
more  than  10  times  as  slow.   On  the  other  hand,  M* 
is  almost  twice  as  fast  as  M(2),  but  it  costs  more 
than  10  tirn^s  as  much  per  byte. 

Fortunately,  most  actual  programs  and  applica- 
tions cluster  their  references  so  that,  during  any 
interval  of  time,  only  a  small  subset  of  the  infor- 
mation is  actually  used.   This  phenomenon  is  known 
as  locality  of  reference.   If  we  could  keep  the 
"current"  information  in  a  small  amount  of  M(l),  we 
could  produce  hierarchical  storage  systems  with  an 
effective  access  time  close  to  M(l)  but  an  overall 
cost  per  byte  close  to  M(2). 

The  actual  performance  attainable  by  such  hier- 
archical storage  systems  is  addressed  by  the  other 
papers  in  this  session12 < 14 < 2°. 23  and  the  general 
literature2 ' 3 <4 -6 • 7 «8' 10.11. 13, 15,18, 19, 21, 22.   In 
many  actual  systems  it  has  been  possible  to  find 
over  90%  of  all  references  in  M(l)  even  though  M(l) 
was  much  smaller  than  M(2)13-l5. 

Automated  Hierarchical  Storage  Systems 

There  are  at  least  three  ways  in  which  the 
locality  of  reference  can  be  exploited  in  a  hier- 
archical storage  system:   static,  manual,  or  automatic. 

If  the  pattern  of  reference  is  constant  and 
predictable  over  a  long  period  of  time  (e.g.,  days, 
weeks,  or  months),  the  information  can  be  statically 
allocated  to  specific  storage  devices  so  that  the 
most  frequently  referenced  information  is  assigned  to 
the  highest  performance  devices. 


Storage  Level 

Random 
Access 
Time 

Transfer 

Rate 

(btyes/second) 

1.   Cache 

50  ns 

100M 

.2.  Main 

1  US 

16M 

3.   Block 

50  us 

8M 

Technology 


Semiconductor  RAM 

Semiconductor  RAM,  Ferrite  core 

Semiconductor  shift  registers. 
Bulk  ferrite   core.  Charge- 
coupled  devices.  Magnetic  bubbles 


Fixed-head  disks  and  drums.  Charge- 
coupled  devices.  Magnetic  bubbles 


5.  Secondary 

6.  Mass 


50  ms 
1  sec 


.01C 
.005C 


Moving-head  disks 

Automated   tape-handlers.    Laser  devices 


Table    I. 
Spectrum  of   Storage   Device  Technologies 


If  the  pattern  of  reference  is  not  constant  but 
is  predictable,  the  programmer  can  manually  (i.e.,  by 
explicit  instructions  in  the  program)  cause  frequently 
referenced  information  to  be  moved  onto  high  perfor- 
mance devices  while  being  used.   Such  manual  control 
places  a  heavy  burden  on  the  programmer  to  understand 
the  detailed  behavior  of  the  application  and  to 
determine  the  optimum  way  to  utilize  the  storage 
hierarchy.   This  can  be  very  difficult  if  the  applica- 
tion ia  complex  (e.g.,  multiple  access  data  base 
•ystem)  or  is  operating  in  a  multiprogramming  environ- 
nent  with  other  applications. 

In  an  automated  storage  hierarchy,  all  aspects 
of  the  physical  information  organization  and  distribu- 
tion and  movement  of  information  are  removed  from 
the  programmer's  responsibility.   The  algorithms  that 
manage  the  storage  hierarchy  may  be  implemented  in 
hardware,  firmware,  system  software  or  some  combination. 

Contemporary  Automated  Hierarchical  Storage  Systems 

To  date,  most  implementations  of  automated 
hierarchical  storage  systems  have  been  large-scale 
cache  systems  or  paling  systems.  Typically,  cache 
systems,  using  a  combination  of  level  1  and  level  2 
storage  devices  managed  by  specialized  hardware,  have 
been  used  in  large-scale  high-performance  computer 
.4,10,23 


storage 

storage  devices  managed  mostly  by  system  software  with 
limited  support  from  specialized  hardware  and/or 
firmware*. 11, 12. 

Impact  of  Hew  Technology 

A*  a  result  of  new  storage  technologies,  especial- 
ly for  levels  3  and  6,  as  well  as  substantially 
reduced  costs  for  implementing  the  necessary  hardware 
control  functions,  nany  new  types  of  hierarchical 
storage  systems  have  evolved.   Several  of  these 
advances  are  discussed  in  this  session. 

Bill  Strecker23  explains  how  the  cache  system 
concept  can  be  extended  for  use  in  high-performance 
minicomputer  systems. 

Bernard  Greenberg  and  Steven  Weber  12  discuss  the 
design  and  performance  of  a  three-level  paging  system 
used  in  the  Multics  Operating  System  (levels  2,  4,  and 
5  in  the  earlier  Honeywell  645  implementation;  levels 
2,  3,  and  5  in  the  more  recent  Honeywell  68/80  imple- 
mentation) . 

Suran  Ohnigian-"  analyzes  the  handling  of  data 
base  applications  in  a  storage  hierarchy. 

Clay  Johnson*'*  explains  how  a  new  level  6  tech- 
nology is  being  used  in  conjunction  with  level  5 
devices  to  form  a  very  large  capacity  hierarchical 
Storage  system,  the  IBM  3850  Mass  Storage  System. 

The  various  automated  hierarchical  storage  systems 
discussed  in  this  session  are  depicted  in  Figure  1. 
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Figure    1 

Various  Automated  Hierarchical 
Storage  Systems 

GENERAL  HIERARCHICAL  STORAGE  SYSTEM 

All  of  the  automated  hierarchical  stoarge  systems 
described  thus  far  have  only  dealt  with  two  or  three 
levels  of  storage  devices  (the  Honeywell  68/80  Hultiea 
system  actually  encompasses  four  levels  since  it 
contains  a  cache  memory  which  is  partially  controlled 
by  the  system  software).   Furthermore,  many  of  these 
systems  were  initially  conceived  and  designed  several 
years  ago  without  the  benefit  of  our  current  knowledge 
about  memory  and  processor  capabilities  and  costs.   In 
this  section  the  structure  of  a  general  automated 
hierarchical  storage  system  is  outlined  (see  Figure  2) . 

Continuous  and  Corplete  Hierarchy 

An  automated  storage  hierarchy  must  consist  of  a 
continuous  and  complete  hierarchy  that  encompasses  the 
full  range  of  storage  levels.   Otherwise,  the  user 
would  be  forced  to  rely  on  manual  or  semi-automatic 
storage  management  techniques  to  deal  with  the  storage 
levels  that  are  not  automatically  managed. 

Page  Splitting  and  Shadow  Storage 

The  average  time,  Tm,  required  to  move  a  page,  a 
unit  of  information,  between  two  levels  of  the  hier- 
archy consists  of  the  sum  of  (1)  the  average  access 
time,  T,  and  (2)  the  transfer  time,  which  is  the 
product  of  the  transfer  rate,  R,  and  the  size  of  a 
page,  N.   Thus,  Ta  -  T  +  R  x  N. 


Level 

0.  Processor 


franc  fez  kilt 


5.   Secondary 


figure  2 

General  Automated 
Hierarchical  Storage  Systea 


Ocice  cf  Face  S:ze.      By  exa: 
tative  devices  indicated  in  T^cle 
tines  var;.  audi  -ore  than  transfer  rates.   As  a  result, 
Ta  is  very  sensitive  to  the  transfer  unit  size,  S,  for 
devices  with  s-sll  values  of  T  [e.g..  levels  1  and  2). 
Thus,  for  these  devices  si-all  page  sizes  such  as  N=16 
or  32  bytes  are  typically  used.   Conversely,  for 
devices  with  large  values  for  T,  the  marginal  cost  of 
increasing  the  pace  size  is  quite  ssaall,  especially  in 
view  of  the  benefits  possible  due  to  locality  of 
reference.   Thus,  page  sizes  such  as  N-4D96  bytes  are 
typical  for  levels  ■;  a-d  5.   "uch  larger  units  of 
information  transfer  are  used  for  level  6  devices. 

Since  the  marginal  increase  in  Ta  decreases  none— 
tonically  as  a  function  of  storage  level,  the  r.-ier  cf 
bytes  transferred  between  levels  should  ^-crease 
correspondingly.   In  order  to  simplify  the  implementa- 
tion of  the  systea  and  to  be  consistent  with  the 
normal  mappings  frora  processor  address  references  to 
page  addresses,  it  is  desirable  that  all  :a;e  sices  be 
a  power  cf  two.   7-.e  choice  for  each  page  size  depends 
upon  the  characteristics  of  the  programs  to  be  run  and 
the  effectiveness  of  the  overall  stcrace  system.   Pre- 
liminary measurements  indicate  that  a  ratio  of  u.  :  1  cr 
8:1  between  levels  is  reasonable.   Meade:?  has  reported 
■inilar  findings.   Figure  3  indicates  possible  transfer 
unit  sizes  between  the  6  levels  presented  in  Table  I. 
The  particular  choice  of  size  will,  of  course,  depend 
upon  the  particulars  of  the  devices. 


Figure  3 
Saaple  Transfer  Unit  Sizes 
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movement  of  :::->t.:-  i-  the  stcrace  -lerarr-v.   At 
time  :   the  processor  generates  a  refere-re  to* 
logical  address  a.   kssone  that  the  corresponding 
Information  _s  not  currently  stored  i-  -  .   : .-  - 
but  -s  found  in  M<3!.   For  simplicity,  assume  that 
pace  sizes  are  doubled  as  --  :r  iota  the  ttierarcbj 
(i.e.,  N(2.3)-2xN(1.2).  S  (3  ,4)-2x.N  (2,3)-4xS  (1 ,2)  .etc. , 
see  Figure  •).   The  zzze   cf  size  B12.3   containing 
a  is  copied  from  M(3)  to  M(2).   n(2)  now  ro-ta.-s  the 
reeded  ir.f  orm-aticr. ,  so  we  rep-eat  the  process.   Ota 
page  of  size  ■U.2]  containing  a  .s  copied  froe  M(2) 
to  M(l).   sow,  finally,  th.e  page  cf  size  N(0,1) 
containing  a  is  copied  free  -  l   and  fcrwaroed  to  the 
processor.   In  t-e  process  the  page  cf  Information  is 
split  repeatedly  as  it  moves  up  the  biexaa 
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capes   are   art-ally   copied   as    they  move  up   the   hierarchy] 
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Page  Splitting  and 
Shadow  Storage 


Direct  Transfer 


In  the  descriptions  above  it  is  implied  that 
information  is  directly  transferred  between  adjacent 
levels.   By  comparison,  most  present  multiple  level 
storage  systems  are  based  upon  an  indirect  transfer 
(e.g.,  the  Multics  Page  Multilevel  system1^) .   in  an 
indirect  transfer  system,  all  information  is  routed 
through  level  1  or  level  2.   For  example,  to  move  a 
page  from  level  n-1  to  level  n,  the  page  is  moved  from 
level  n-1  to  level  1  and  then  from  level  1  to  level  n. 

Clearly,  an  indirect  approach  is  undesirable 
since  it  requires  extra  page  movement  and  its  associ- 
ated overhead  as  well  as  consuming  a  portion  of  the 
limited  M(l)  capacity  in  the  process.   In  the  past 
such  indirect  schemes  were  necessary  since  all  inter- 
connections were  radial  from  the  main  memory  or  pro- 
cessor to  the  storage  devices.   Thus,  all  transfers 
had  to  be  routed  through  the  central  element.   Further- 
more, due  to  the  differences  in  transfer  rate  between 
devices,  especially  electro-mechanical  devices  that 
must  operate  at  a  fixed  transfer  rate,  direct  transfer 
was  not  possible. 

Based  upon  current  technology,  these  problems  can 
be  solved.   Many  of  the  newer  storage  devices  are  non- 
electromechanical  (i.e.,  strictly  electrical).   In 
these  cases  the  transfer  rates  can  be  synchronized  to 
allow  direct  transfer.   For  electromechanical  devices. 


direct  transfer  is  possible  if  transfer  rates  are 
similar  or  a  small  scratchpad  buffer,  such  as  low-cost 
semiconductor  shift  registers,  are  used  to  enable 
synchronization  between  the  devices.   Direct  transfer 
is  provided  between  the  level  6  and  level  5  storage 
devices  in  the  IBM  3850  Mass  Storage  System14. 

Read  Through 

A  possible  implementation  of  the  page  splitting 
strategy  could  be  accomplished  by  transferring  infor- 
mation from  level  i  to  the  processor  (level  0)  in  a 
series  of  sequential  steps:   (1)  transfer  page  of  size 
N(i-l,i)  from  level  i  to  level  i-1,  and  then  (2) 
extract  the  appropriate  page  subset  of  size  N(i-2,  i-1) 
and  transfer  it  from  level  l-l  to  level  i-2,  etc. 
Under  such  a  scheme,  a  transfer  from  level  i  to  the 
processor  would  consist  of  a  series  of  i  steps.   This 
would  result  in  a  considerable  delay,  especially  for 
certain  devices  that  require  a  large  delay  for  reading- 
recently  written  information  (e.g.,  rotational  delay 
to  reposition  electromechanical  devices)  . 

The  inefficiencies  of  the  sequential  transfers 
can  be  avoided  by  having  the  Information  read  through 
to  all  the  upper  levels  simultaneously.   For  example, 
assume  the  processor  references  logical  address  a  which 
is  currently  stored  at  level  i  (and  all  lower  levels, 
of  course).   The  controller  for  M(i)  outputs  the 
N(i-l,i)  bytes  that  contain  a  onto  the  data  buses 
along  with  their  corresponding  addresses.   The  con- 
troller for  M(i-l)  will  accept  all  of  these  N(i-l,i) 
bytes  and  store  them  in  M(i-l).   At  the  same  time, 
the  controller  for  M(i-2)  will  extract  the  particular 
N(i-2,  i-1)  bytes  that  it  wants  and  store  them  in 
M(i-2).   In  a  like  manner,  all  of  the  controllers  can 
simultaneously  extract  the  portion  that  is  needed. 
Using  this  mechanism,  the  processor  (level  0)  extracts 
the  N(0,1)  bytes  that  it  needs  without  any  further 
delays  —  thus  the  information  from  level  i  is  read 
through  to  the  processor  without  any  delays.   This 
process  is  illustrated  in  Figure  5. 

The  read  through  mechanism  also  offers  some 
important  reliability,  availability,  and  serviceability 
advantages.   Since  all  storage  levels  communicate 
anonymously  over  the  data  buses,  if  a  storage  level 
must  be  removed  from  the  system,  there  are  no  changes 
needed.   In  this  case,  the  information  is  "read 
through"  this  level  as  if  it  didn't  exist.   No  changes 
are  needed  to  any  of  the  other  storage  levels  or  the 
storage  management  algorithms  although  we  would  expect 
the  performance  to  decrease  as  a  result  of  the  missing 
storage  level.   This  reliability  strategy  is  employed 
in  most  current-day  cache  memory  systems^. 

Store  Behind 

Under  steady-state  operation,  we  would  expect  all 
the  levels  of  the  storage  hierarchy  to  be  full  (with 
the  exception  of  the  lowest  level,  L) .   Thus,  whenever 
a  page  is  to  be  moved  into  a  level,  it  is  necessary 
to  remove  a  current  page.   If  the  page  selected  for 
removal  has  not  been  changed  by  means  of  a  processor 
write,  the  new  page  can  be  immediately  stored  into 
the  level  since  a  copy  of  the  page  to  be  removed 
already  exists  in  the  next  lower  level  of  the  hier- 
archy.  But,  if  the  processor  performs  a  write  opera- 
tion, all  levels  that  contain  a  copy  of  the  informa- 
tion being  modified  must  be  updated.   This  can  be 
accomplished  in  either  of  at  least  three  basic  ways: 
(1)  store  through,  (2)  store  replacement,  or  (3) 
store  behind. 


rr€ 


processor 


Figure   5 

Read-Through  Example 
(Data  transferred  from 
M(3)  to  M(2),  M(l), 
and  processor 
simultaneously) 


Store  Through.   Under  a  store  through  strategy, 
all  levels  are  updated  simultaneously.   This  is  the 
logical  inverse  of  the  read  through  strategy  but  it 
has  a  crucial  distinction.   The  store  through  is 
limited  by  the  SDeed  of  the  lowest  level,  L,  of  the 
hierarchy,  whereas  the  read  through  is  limited  by  the 
speed  of  the  highest  level  containing  the  desired 
information.   Store  through  is  efficient  only  if  the 
access  time  of  level  L  is  comparable  to  the  access 
time  of  level  1,  such  as  in  a  two-level  cache  system 
(e.g.,  it  is  used  in  the  IBM  System/370  Models  158 
and  168). 

.tore  replacement 


Store  Replacement.   Under 
strategy,  the  processor  only  updates  M(l)  and  marks 
the  page  as  "changed."  If  such  a  changed  page  is 
later  selected  for  removal  from  M(l),  it  is  then  moved 
to  the  next  lower  level,  fl(2),  immediately  prior  to 
being  replaced.   This  process  occurs  at  every  level 
and,  eventually,  level  L  will  be  updated  but  only 
after  the  page  has  been  selected  for  removal  from  all 
the  higher  levels.   Due  to  the  extra  delays  caused  by 
updating  changed  pages  before  replacement,  the 
tffective  access  tine  for  reads  is  increased.   Various 
versions  of  store  replacement  are  used  in  most  two- 
level  paging  systems  since  it  offers  substantially 
Better  performance  than  store  through  for  slow  storage 
levices  (e.g.,  disks  and  drums). 


Store  Behind.   In  both  strategies  above,  the 
storage  system  was  required  to  perform  the  update 
operation  at  some  specific  time,  either  at  the  time  of 
write  or  replacement.   Once  the  information  has  been 
stored  into  M(l),  the  processor  doesn't  really  care 
how  or  when  the  other  copies  are  updated.   Store 
behind  takes  advantage  of  this  degree  of  freedom. 
The  maximum  transfer  capability  between  levels  is 
rarely  maintained  since,  at  any  instant  of  time,  a 
storage  level  may  not  have  any  outstanding  requests 
for  service  or  it  may  be  waiting  for  proper  position- 
ing to  service  a  pending  request.   During  these  "idle- 
periods,  data  can  be  transferred  down  to  the  next 
lower  level  of  the  storage  hierarchy  without  adding 
any  significant  delays  to  the  read  or  store  operations. 
Since  these  "idle"  periods  are  usually  quite  frequent 
under  normal  operation,  there  can  be  a  continual  flow 
of  changed  information  down  through  the  storage  hier- 
archy . 

A  variation  of  this  store  behind  strategy  i3  used 
in  the  Multics  Page  Multilevel  system12  whereby  a  . 
certain  number  of  pages  at  each  level  are  kept  avail- 
able for  immediate  replacement.   If  the  number  of 
replaceable  pages  drops  below  the  threshold,  changed" 
pages  are  selected  to  be  updated  to  the  next  lower 
level.   The  actual  writing  of  these  pages  to  the  lower 
level  is  scheduled  to  take  place  when  there  is  no 
real  request  to  be  done. 

The  store  behind  strategy  can  also  be  used  to 
provide  high  reliability  in  the  storage  system. 
Ordinarily,  a  changed  page  is  not  purged  from  a  level 
until  the  next  lower  level  acknowledges  that  the 
corresponding  page  has  been  updated.   We  can  extend 
this  approach  to  require  two  levels  of  acknowledgement. 
For  example,  a  changed  page  is  not  removed  from  M(l) 
until  the  corresponding  pages  in  both  M(2)  and  M(3) 
have  been  updated.   In  this  way,  there  will  be  at 
least  two  copies  of  each  changed  piece  of  information 
at  levels  M(i)  and  M(i*l)  in  the  hierarchy.   If  any 
level  malfunctions  or  must  be  serviced,  it  can  be 
removed  from  the  hierarchy  without  causing  any  infor- 
mation to  be  lost.   There  are  two  exceptions  to  this 
process,  levels  M(l)  and  M(L).   To  completely  safe- 
guard the  reliability  of  the  system,  it  may  be 
necessary  to  store  duplicate  copies  of  information  at 
these  levels.   Figure  6  illustrates  this  process. 

Distributed  Control 

There  is  a  substantial  amount  of  work  required 
to  manage  the  storage  hierarchy.   It  is  desirable  to 
remove  as  much  as  possible  of  the  storage  management 
from  the  concern  of  the  processor.   In  the  hier- 
archical storage  system  described  in  this  paper,  all 
of  the  management  algorithms  can  be  operated  based 
upon  information  that  is  local  to  a  level  or,  at  most, 
in  conjunction  with  information  from  neighboring 
levels.   Thus,  it  is  possible  to  distribute  control 
of  the  hierarchy  into  the  levels,  this  also  facili- 
tates parallel  and  asychronous  operation  in  the 
hierarchy  (e.g.,  the  store  behind  algorithm). 

The  independent  control  facilities  for  each 
level  can  be  accomplished  by  extending  the  function- 
ality of  conventional  device  controllers.   Most 
current-day  sophisticated  device  controllers  are 
actually  microprogrammed  processors  capable  of  per- 
forming the  storage  management  function.   The  IBM 
3850  Mass  Storage  System  uses  such  a  controller14. 
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